GRAB SAFETY CHALLENGE - ACCT672 Machine Learning for Business

Step 1 Data Pre-processing

Next, put all our data files into panda dataframes. There are 10 telematics csv files and 1 label csv file (dangerous trips are labled as 1)

Step 2 Data Cleaning

Identify & remove conflicting trips and it observations (Being labeled 0 & 1 at the same time)

Step 3 Exploratory Data Analysis

Speed measured by GPS

Looking at the plot between 2 speed of the safe and unsafe we can't really tell

We'll see if acceleration derived could have some different

Deriving acceleration from velocity (speed)

\begin{equation} a = \frac{{\Delta \upsilon }}{{\Delta t}} = \frac{{\upsilon_{f} - \upsilon_{i}}}{{t_{f}-t_{i}}} \end{equation}

Step 4 Feature Engineering and Selection

Abnormal findings

Step 5 : Machine Learning and Modelling

We will use the following models: RandomForestClassifier, GradientBoostingClassifier, LogisticRegression, GaussianNB and XGBClassifier

first, to define test and training data

H2O method

Step6 Classical feature importance

Standard classifcal feature importance

Explain predictions

Here we use the various SHAP implementation integrated into the models to explain the dataset (xxxx samples).

Visualize a single prediction

Visualize many predictions

To keep the browser happy we only visualize 1,000 individuals.

SHAP Summary Plot

Rather than use a typical feature importance bar chart, we use a density scatter plot of SHAP values for each feature to identify how much impact each feature has on the model output for individuals in the validation dataset. Features are sorted by the sum of the SHAP value magnitudes across all samples.

Trip duration - It is interesting to note that that the longer rides are associated with dangerous driving. This could possiblity be explained by driver fatigue, verocious driving conditions etc.

Speeding - We could infer that the longer the ride there is higher chances of dangerous encounter. Higher max speed may indicated dangerous drives.

Acceleration - It is interesting to note that more acceleration can indicator more dangerous driving styles of drivers

Accuracy - Low accuracy may not neceesary relate to dangerous driver. It could indicators driver may take certain higher risk manevours to meet service quality time. This used as dependent variable for other use cases to improve customer statisfaction.

SHAP Simple Dependence Plots

SHAP dependence plots show the effect of a single feature across the whole dataset. They plot a feature’s value vs. the SHAP value of that feature across many samples. SHAP dependence plots are similar to partial dependence plots, but account for the interaction effects present in the features, and are only defined in regions of the input space supported by data. The vertical dispersion of SHAP values at a single feature value is driven by interaction effects, and another feature is chosen for coloring to highlight possible interactions.